HDL Cholesterol Prediction using Machine Learning

MAS 635 - Machine Learning Methods

Rolando Vargas, Eleniz Espina, Bryce Leister

University of Miami

2026-02-01

Overview

Goal: Predict HDL cholesterol levels

Dataset: NHANES (National Health and Nutrition Examination Survey)

Approach: Machine Learning & Deep Learning

Key Stats:

  • 1,000 training samples
  • 95 features
  • 7 baseline + 2 DL models
  • Stacking ensemble predictions

Background & Motivation

Why HDL Cholesterol?

High-Density Lipoprotein (HDL) - “Good Cholesterol”

  • Cardiovascular Health: Higher HDL = Lower heart disease risk
  • Clinical Importance: Key indicator for patient screening
  • Modifiable: Can be improved through lifestyle changes
  • Predictive Value: Early identification enables intervention

Dataset: ASA South Florida Data Challenge

National Health and Nutrition Examination Survey (NHANES)

  • Source: ASA South Florida Student Data Challenge
  • Features: Demographics, body measurements, diet, lab values
  • Size: 1,000 training, 200 test samples
  • Target: LBDHDD_outcome - Direct HDL cholesterol (mg/dL)

Normal Ranges:

  • Men: 40-60 mg/dL
  • Women: 50-60 mg/dL

Data Exploration

Missing Data Analysis

Target Distribution

Key Observations:

  • Approximately normal distribution
  • Mean: 54.73 mg/dL, Std: 9.01 mg/dL
  • Slight right skew (0.376)

Feature Correlations

Top Correlations with HDL:

  • Body measurements (BMI, waist): Negative
  • Dietary factors: Mixed effects
  • Demographics: Moderate associations

Exploratory Analysis

HDL vs BMI

Important

Strong Negative Correlation: Higher BMI → Lower HDL

HDL by Demographics

Key Findings:

  • Significant differences across sex and age groups
  • Ethnic variations present
  • Smoking status shows clear patterns

HDL vs Waist Circumference

Clinical Insight: Abdominal obesity strongly associated with lower HDL

Methodology

Data Preprocessing Pipeline

  1. Missing Value Imputation
    • Strategy: Median imputation (safety net)
    • Dataset has 0% missing values
  2. Feature Scaling
    • StandardScaler (mean=0, std=1)
    • All 95 numeric features normalized
  3. Feature Types
    • All 95 features are numeric
    • No categorical encoding needed
  4. Train-Validation Split
    • 800 training (80%) / 200 validation (20%)

Model Architecture

Baseline Models (7):

  • Linear Regression
  • Ridge Regression
  • Elastic Net
  • Random Forest
  • Gradient Boosting
  • XGBoost
  • CatBoost

Deep Learning (2):

  • Standard Neural Network
    • 4 hidden layers
    • BatchNorm + Dropout
  • Advanced NN
    • Skip connections
    • Residual architecture

Neural Network Architecture

# Standard Tabular Neural Network
Input (95 features)

Dense(256) → BatchNorm → ReLU → Dropout(0.3)

Dense(128) → BatchNorm → ReLU → Dropout(0.3)

Dense(64) → BatchNorm → ReLU → Dropout(0.2)

Dense(32) → BatchNorm → ReLU → Dropout(0.1)

Output (1) - HDL prediction

Regularization:

  • L2 regularization (0.001)
  • Dropout layers
  • Early stopping

Training Strategy

  • Loss Function: Mean Squared Error (MSE)
  • Optimizer: Adam (lr=0.001)
  • Callbacks:
    • Early stopping (patience=20)
    • Learning rate reduction (factor=0.5)
  • Epochs: Up to 200 (with early stopping)
  • Batch Size: 32
  • Cross-Validation: 5-fold CV for top models
  • Hyperparameter Tuning: Optuna (automated Bayesian search)

Results

Training History

Observations:

  • Early stopping activated in both architectures
  • NNs underperformed tree-based models (RMSE 6.69–7.10 vs 5.03)
  • Small dataset (n=1,000) limits deep learning effectiveness

Model Comparison

Performance Metrics

Model RMSE ↓ MAE ↓ R² ↑
Gradient Boosting 5.0312 3.9438 0.6963
CatBoost 5.0388 3.9370 0.6954
XGBoost 5.1887 4.1054 0.6770
Elastic Net 5.8996 4.6086 0.5825
Basic NN 6.6930 5.2168 0.4626
Advanced NN 7.0997 5.5689 0.3953

Note

Final Approach: Stacking ensemble with Optuna-tuned models and Ridge meta-learner

Stacking Ensemble Strategy

Approach: Out-of-fold stacking with Ridge meta-learner

Base Models:

  • CatBoost (Optuna-tuned)
  • XGBoost (Optuna-tuned)
  • Gradient Boosting
  • Random Forest

Stacking OOF RMSE:

Model OOF RMSE
XGBoost 4.7031
CatBoost 4.7318
GradBoost 4.8308
Stacked 4.6434

Why Stacking?: Meta-learner finds optimal model combination, achieving 1.27% improvement over best single model

Prediction Distribution

Validation: Test predictions align well with training distribution

Business Insights

Healthcare Applications

  1. Early Risk Detection
    • Identify patients at cardiovascular risk
    • Predict HDL levels without blood tests
    • Use demographic and lifestyle data
  2. Personalized Interventions
    • Target high-risk individuals
    • Tailor lifestyle recommendations
    • Monitor intervention effectiveness
  3. Resource Optimization
    • Prioritize screening resources
    • Focus on high-risk populations
    • Reduce unnecessary testing

Key Predictive Factors

Modifiable Risk Factors:

  • Body Mass Index - Weight management interventions
  • Waist Circumference - Abdominal fat reduction
  • Dietary Habits - Nutrition counseling opportunities
  • Physical Activity - Exercise program recommendations

Non-Modifiable Factors:

  • Age, sex, ethnicity - Risk stratification

Clinical Recommendations

For Healthcare Providers:

  • Integrate predictions into EHR systems
  • Use for preliminary screening
  • Prioritize patients with predicted low HDL
  • Validate predictions with actual tests

For Patients:

  • Focus on modifiable factors
  • Weight management programs
  • Dietary improvements
  • Regular physical activity
  • Smoking cessation

Model Limitations

Warning

Important Considerations:

  • Model trained on noise-perturbed data (privacy)
  • Cannot replace clinical blood tests
  • Predictions are estimates, not diagnoses
  • Should be used as screening tool only
  • Requires validation in clinical settings

Future Work

Potential Improvements

  1. Feature Engineering
    • Interaction terms
    • Polynomial features
    • Domain-specific transformations
  2. Advanced Architectures
    • Attention mechanisms
    • Transformer models for tabular data
    • AutoML approaches
  3. External Validation
    • Test on different populations
    • Temporal validation
    • Cross-institutional studies

Deployment Considerations

  • Web Application: Patient/provider interface
  • API Integration: EHR system integration
  • Mobile App: Point-of-care predictions
  • Monitoring Dashboard: Population health tracking
  • Continuous Learning: Model updates with new data

Conclusion

Summary

  • Successfully predicted HDL cholesterol using ML/DL
  • Stacking ensemble with Optuna tuning achieved best performance (OOF RMSE: 4.6434)
  • Identified key modifiable risk factors (BMI, waist, diet)
  • Demonstrated clinical applicability
  • Provided actionable business insights

Key Takeaways

Technical:

  • Tree-based models excel on tabular data
  • Optuna hyperparameter tuning improved model performance
  • Stacking ensemble outperforms simple averaging

Clinical:

  • Predictive models enable early intervention
  • Focus on modifiable risk factors
  • Data-driven healthcare decision support

Questions?

Contact:

  • Rolando Vargas
  • Eleniz Espina
  • Bryce Leister

GitHub: github.com/rvargasm7/hdl-prediction-project

Resources:

  • Full Jupyter Notebook
  • Code & Visualizations
  • Preprocessed Data
  • Model Documentation

Appendix

References

  1. CDC NHANES Program - National health survey data
  2. ASA South Florida Data Challenge - Competition source
  3. XGBoost, CatBoost Documentation - Model implementations
  4. TensorFlow/Keras - Deep learning framework
  5. Scikit-learn - ML preprocessing and baseline models
  6. Optuna - Hyperparameter optimization framework

Technical Details

Computing Environment:

  • Python 3.10+
  • TensorFlow 2.20+
  • XGBoost 3.1+
  • CatBoost 1.2+
  • Optuna (Bayesian hyperparameter tuning)
  • 1,000 training samples, 95 features

Training Time:

  • Baseline models: ~2 minutes
  • Deep learning: ~5 minutes
  • Optuna tuning: ~15 minutes
  • Stacking ensemble: ~5 minutes

Data Dictionary (Sample)

Variable Description Type
LBDHDD_outcome HDL Cholesterol (mg/dL) Target
RIAGENDR Gender (1=M, 2=F) Categorical
RIDAGEYR Age (years) Numeric
BMXBMI Body Mass Index Numeric
BMXWAIST Waist Circumference (cm) Numeric

Full data dictionary available in repository